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Abstract Body 


Background / Context: 

Recent research on multiple measures of teaching effectiveness has redefined the role of in- 
classroom observations in teacher evaluation systems. In particular, most states now mandate that 
teachers are observed on multiple occasions during the school year, and it is increasingly common 
that multiple raters are utilized across the different rating occasions (White, 2014). Teacher ob- 
servations also continue to make up the majority of weight in many districts’ evaluation systems. 
For example, the NYC DOE current allocates 60% of the weight in teacher evaluations to between 
four and six observations that are made throughout the school year, with the remaining 40% split 
over value-added measures and other local criteria. 

In-classroom observations are typically conducted using a rating rubric, with the Danielson 
Framework (Danielson, 2013) being one prominent example. Despite the growing evidence that 
rating rubrics can provide useful information about teaching practices (Kane & Staiger, 2012), it 
remains less clear how that information should be summarized to support consequential inferences 
about individual teachers. Researchers in the teacher evaluation literature have summarized the 
rubrics with a total score, which has led to g-theory studies of the reliability of these scores over 
multiple rating occasions and raters (Ho & Kane, 2013). However, related research has found that 
many rubrics measure multiple dimensions of instructional quality (Grossman et al., 2014; Kane 
& Staiger, 2012; Savitsky & McCaffrey, 2014), suggesting that teachers’ practices are not well de- 
scribed in terms of a total score. Halpin & Kieffer (2015) argued for the use of latent class analysis 
(LCA) as a means of capturing the multidimensional features of rating rubrics, while also provid- 
ing the standard error of measurement for each teacher, and item-level diagnostic information that 
can be used as the basis of feedback to educators and for professional development. However, the 
analysis of Halpin & Kieffer was conducted using teacher-aggregate data and therefore did not 
allow for a model-based investigation of reliability over multiple rating occasions and raters. 

Purpose / Objective / Research Question / Focus of Study: 

The main purpose of the present research is to develop a multilevel extension of the LCA method- 
ology described by Halpin & Kieffer (2015). For a given rating rubric, the multilevel LCA ap- 
proach is specifically intended to answer the following questions: (a) How reliably (precisely) is 
a teacher’s teaching ability measured during any single observation session? (b) How consistently 
does a teacher perform over observation sessions? (c) For a given teacher, how many observation 
sessions are required before his/her teaching ability has been measured with a desired level of pre- 
cision? The last question in particular has relevance for policy, in that multi-rater systems can place 
heavy financial demands on school districts in terms of deploying a sufficient number of trained 
raters to meet required number of observation sessions per teacher. The proposed methodology 
allows for decisions about the required number of observations to be made on a teacher-by-teacher 
basis, and to be informed by the data collected from each teacher. 
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An additional purpose of this research is to provide a satisfactory solution to the problem of rater 
reliability within the multilevel LCA framework. At the time of this writing, this is a continuing 
area of research. The specific topics to be addressed are (a) how to control for the effects of 
raters when making inferences about individual teachers, (b) requirements on how raters should 
be deployed to teachers in order for the rater effects to be identified, and (c) methods for inferring 
whether a rater is performing within expectation for a given population of teachers. Initial work is 
outlined in the description of the model below. 

Significance / Novelty of Study: 

Teacher observations are taking an increasingly prominent role in teacher evaluation systems. 
Previous research on the reliability of teacher observations has employed a true-score model of 
teaching ability in combination with generalizability theory (Kane & Staiger, 2012; Ho & Kane, 
2014). While leading to significant advances in the practice of teacher observations, this work can 
also be recognized as having a number of shortcomings. In particular, related research has made 
it apparent that many rating rubrics tend to be multidimensional (Grossman et al., 2014; Kane 
& Staiger, 2012; Savitsky & McCaffrey, 2014), and hence the utility of a true-score approach to 
reliability is brought into question. Halpin & Kieffer (2015) proposed an approach based on LCA 
that is compatible with the evidence that teachers’ practices are multidimensional, but this did not 
address the reliability of scores over observation sessions. The present work fills this void, while 
working to address the role that raters play in generating the session-level data. 

Statistical, Measurement, or Econometric Model: 

This section sketches the basic details of how multilevel LCA can be applied as a measurement 
model for teacher observations. Estimation of multilevel LCA via a multinomial regression spec- 
ification using the EM algorithm is described by Vermunt (2003, 2004) and this can be applied to 
the present application. However, the proposed approach to rater effects requires embedding inte- 
gration over raters on the E-step, which is accomplished using the method described by Hedeker 
(2003). 

The model. Let Xjjk denote the rating assigned to item k of a rating rubric, observed during 
session j of teacher i. It is assumed that each is random variable with support x = {x r \ r = 
1, . . . ,7?}. Let Xjj = (X lJ \ ■X i j 2 , . . . , X(jfk) represents the K- vector of ratings for the i j - th session and 
let Xi = {A,i .X, 2 , . . . ,X irij \ denote the collection of n, sessions for teacher i = 1, . . . , N. 

At the teacher level, introduce a discrete latent variable Z, with support z = {zt \ t — 1, . . . , T} 
and assume that the joint probability mass function (pmf) / J (A,.Z ( ) is well-defined. Then the pmf 
the observations of teacher i is 

P(X,) = £>(X, I Z, = Zt) P(Zi = Zt). (1) 

z 
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The basic idea behind the teacher-level model is to select the minimum value of T such that 

P(Xi I Z, = Zt) = n P{Xij I Z ; ; = Zt) (2) 

j = i 

for all i. The interpretation of Z, is discussed. Any additional clustering of observation sessions 
within teachers (due to groups of students, subject taught, specific lessons, time of day, etc.) is 
considered uninteresting and addressed using established methods for model mis specification (e.g., 
cluster-robust standard errors). 

Similar to the teacher level, introduce a discrete latent variable Yjj with support y = {y, v | s = 
1, . . . ,5} at the session level and assume that P(Xjj. Yjj.Zj) is well-defined. Then 

P{Xij | Z, = Zt ) = J^PiX,, | Yij = y s ,Z, = zt) PiYij = y s \ Z, = z t ). (3) 

y 

It is assumed that the measurement model is invariant with respect to Z ( , 

P(X iJ \Y i j,Z i )=P(X i j\Y ij ), (4) 

and that the item responses are independent given Yjj, 

P(X i j\Y i j) = Y[P(X ijk \Y ij ). (5) 

k 

These standard psychometric assumptions and their alternatives are discussed. 

Substituting (2) through (5) into (1) gives the basic model. 

P(Xi) J LYlLYl P ( X ijk \ Y ij = Vs) P(Y,j = y s I Z, = Zt) P(Zj = Zt). (6) 

z j y k 

Classification of teachers. In standard applications, posterior analysis is via P(Z, \ Xj) and it is 
assumed that all observations Xj arrive simultaneously. In professional settings, however, teachers 
are observed on different occasions sequentially throughout the school year. Letting smaller values 
of j correspond to earlier observations and ^ = {Xj\.X, 2 , . . . ,Xjj}, this leads to the following 
Bayesian updating scheme: 


P(Z, I X i{j} ) oc P(Xjj I Zj) P(Zj I X i{J _i } ) (7) 

In the example it is illustrated how this approach leads to reliable classification of some teachers 
using fewer observation sessions than for other teachers. 

Rater effects. The choice of rater can influence the observed values recorded during an obser- 
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vation session. Consequently, raters can be conceptualized in terms of violations of measurement 
invariance: 

r(X,j k Yij.W’ij) j P(X iJk Yij) (8) 

where Wjj denotes the rater of session j of teacher i. It is assumed that the choice of rater influences 
neither the type of teaching practice demonstrated during a session nor the type of teacher being 
observed. Since differences among raters are a nuisance feature, it is advantageous to consider the 
marginal probabilities 

P(X ljk | Y U ) = J P(X ljk | Yij,u) f w (u ) d(u), (9) 

which bring us back to equation (5). Implication of this modeling strategy are discussed, including 
strategies for deployment of raters and how to evaluate bias among raters. 

Usefulness / Applicability of Method: 

The utility of the proposed method is illustrated with a secondary analysis of the Measures of Ef- 
fective Teaching (MET) longitudinal database (www.metproject.org), focussing on middle school 
English language arts teachers in the first year of the study and using the PLATO rating rubric 
(Grossman et al., 2014). The accuracy with which teachers in the sample were classified is shown 
as a function of the number of observation sessions in Figure 1. The consistency with which 
teachers were classified is shown for a subset of nine teachers in Figure 2. These analyses are 
preliminary and further analyses are to follow. 

Conclusions 

Many school districts across the nation are taking seriously the call to multiple measures of 
teacher effectiveness, and central to these efforts is an increased utilization of multi-rater observa- 
tion systems. This is a good example of the kind translation between knowledge and practice that 
is the theme of the present conference. 

While multi-rater systems have been motivated by research about how to improve the reliability 
of scores obtained from rating rubrics, there still remains much work to be done in terms reporting 
the standard error of measurement when making consequential decisions about individual teachers’ 
proficiency levels, using the information provided by instruments to make inferences about the 
specific strengths and weaknesses of individual teachers’ practices, obtaining optimal strategies 
for rater deployment both in terms of the total number of observation sessions required per teacher 
and the identification of bias among raters, and doing all of this in a statistical framework that is 
compatible with well-established multidimensionality of teacher observation data, yet feasible to 
implement. The approach developed in this research addresses these issues and thereby facilitates 
the continued improvement of multi-rater systems as they are currently used in practice. 
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Appendix B. Tables and Figures 


Figure 1. Precision of Classification of Teachers as a Function of Number of Observations 
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Figure 2. Consistency of Classification of N = 9 Teachers as a Function of Number of 
Observations. 


Teacher 1 



Teacher 2 


variable 

— Z= 1 
— 1 = 2 

— Z = 3 
Z = 4 



Teacher 3 


variable 
— Z = 1 
— Z = 2 
— Z = 3 
— Z = 4 



variable 

— Z = 1 

— Z = 2 
— 1 = 2 

Z = 4 




SREE Spring 2016 Conference Abstract Template 


B-2 


